Search CORE

31 research outputs found

The DeepZen Speech Synthesis System for Blizzard Challenge 2023

Author: Maia Ranniery
Papandreou Spyridoula
Veaux Christophe
Publication venue
Publication date: 01/09/2023
Field of study

This paper describes the DeepZen text to speech (TTS) system for Blizzard Challenge 2023. The goal of this challenge is to synthesise natural and high-quality speech in French, from a large monospeaker dataset (hub task) and from a smaller dataset by speaker adaptation (spoke task). We participated to both tasks with the same model architecture. Our approach has been to use an auto-regressive model, which retains an advantage for generating natural sounding speech but to improve prosodic control in several ways. Similarly to non-attentive Tacotron, the model uses a duration predictor and gaussian upsampling at inference, but with a simpler unsupervised training. We also model the speaking style at both sentence and word levels by extracting global and local style tokens from the reference speech. At inference, the global and local style tokens are predicted from a BERT model run on text. This BERT model is also used to predict specific pronunciation features like schwa elision and optional liaisons. Finally, a modified version of HifiGAN trained on a large public dataset and fine-tuned on the target voices is used to generate speech waveform. Our team is identified as O in the the Blizzard evaluation and MUSHRA test results show that our system performs second ex aequo in both hub task (median score of 0.75) and spoke task (median score of 0.68), over 18 and 14 participants, respectively.Comment: Blizzard Challenge 202

arXiv.org e-Print Archive

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

Author: Heiga Zen
Keiichi Tokuda
Ranniery Maia
Tomoki Toda
Yoshihiko Nankaku
Publication venue
Publication date: 01/08/2007
Field of study

SSW6: 6th ISCA Speech Synthesis Workshop, August 22-24, 2007, Bonn, Germany.This paper describes a trainable excitation approach to eliminate the unnaturalness of HMM-based speech synthesizers. During the waveform generation part, mixed excitation is constructed by state-dependent filtering of pulse trains and white noise sequences. In the training part, filters and pulse trains are jointly optimized through a procedure which resembles analysis-bysynthesis speech coding algorithms, where likelihood maximization of residual signals (derived from the same database which is used to train the HMM-based synthesizer) is pursued. Preliminary results show that the novel excitation model in question eliminates the unnaturalness of synthesized speech, being comparable in quality to the the best approaches thus far reported to eradicate the buzziness of HMM-based synthesizers

NAIST Academic Repository

A fixed dimension and perceptually based dynamic sinusoidal model of speech

Author: Hu Qiong
Latorre Javier
Maia Ranniery
Richmond Korin
Stylianou Y.
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2014
Field of study

This paper presents a fixed- and low-dimensional, perceptually based dynamic sinusoidal model of speech referred to as PDM (Perceptual Dynamic Model). To decrease and fix the number of sinusoidal components typically used in the standard sinusoidal model, we propose to use only one dynamic sinusoidal component per critical band. For each band, the sinusoid with the maximum spectral amplitude is selected and associated with the centre frequency of that critical band. The model is expanded at low frequencies by incorporating sinusoids at the boundaries of the corresponding bands while at the higher frequencies a modulated noise component is used. A listening test is conducted to compare speech reconstructed with PDM and state-of-the-art models of speech, where all models are constrained to use an equal number of parameters. The results show that PDM is clearly preferred in terms of quality over the other systems. Index Terms — Sinusoidal Model, Critical band, Vocoder 1

CiteSeerX

Crossref

Edinburgh Research Explorer

An investigation of the application of dynamic sinusoidal models to statistical parametric speech synthesis

Author: Hu Qiong
Latorre Javier
Maia Ranniery
Richmond Korin
Stylianou Yannis
Yamagishi Junichi
Publication venue
Publication date: 01/01/2014
Field of study

speech synthesi

CiteSeerX

Edinburgh Research Explorer

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning

Author: Hu Qiong
Maia Ranniery
Richmond Korin
Stylianou Yannis
Wu Zhizheng
Yamagishi Junichi
Publication venue
Publication date: 01/09/2015
Field of study

Edinburgh Research Explorer

LoTuS: uma Ferramenta Gráfica Extensível para Modelagem, Análise e Verificação de Modelos LTS e PLTS

Author: Barbosa Bruno
Correia Emerson
Filho Messias
Jesuíno Ranniery
Maia Paulo Henrique Mendes
Vieira Lucas
Publication venue: Revista Eletrônica de Iniciação Científica em Computação
Publication date: 05/02/2018
Field of study

Este artigo apresenta LoTuS, uma ferramenta para modelagem gráfica, análise e verificação de comportamento de software usando LTS e PLTS. Suas principais contribuições são: facilitar o processo de modelagem formal através de um mecanismo de drag and drop que permite criar tanto modelos não probabilísticos como probabilísticos; permitir a geração de modelos a partir de outras fontes, como diagramas de sequencia da UML ou rastros de execução; prover um conjunto de técnicas de análise de modelos, como simulação, execução, detecção de deadlock e verificação probabilísticas de propriedades de alcançabilidade; e por fim, fornecer uma API para que desenvolvedores possam adicionar novas funcionalidades através da criação de plugins. A ferramenta foi avaliada em termos de sua usabilidade e desempenho e através de um estudo de caso no qual suas principais funcionalidades foram exercitadas

Em Questao

Archives of the Faculty of Veterinary Medicine UFRGS

Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion

Author: Bonardo
Cassia Valentini-Botinhao
Castellanos
Cooke
Cooke
Cooke
Cooke
Dreschler
Fukada
Garnier
Hansen
Howell
IEEE
Junichi Yamagishi
Junqua
Kawahara
King
Koishida
Koishida
Langner
Lindblom
Lu
McLoughlin
Moore
Nicolao
Patel
Picart
Picheny
Raitio
Ranniery Maia
Sauert
Sauert
Simon King
Summers
Suni
Taal
Tang
Tang
Tang
Toda
Tokuda
Valentini-Botinhao
Valentini-Botinhao
Valentini-Botinhao
Valentini-Botinhao
Valentini-Botinhao
Yamagishi
Yamagishi
Zen
Publication venue: 'Elsevier BV'
Publication date: 01/03/2014
Field of study

Crossref

Edinburgh Research Explorer

On the State Definition for a Trainable Excitation Model in HMM-based Speech Synthesis

Author: Keiichi Tokuda
Ranniery Maia
Satoshi Nakamura
Shinsuke Sakai
Tomoki Toda
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/03/2023
Field of study

Institutional Repositories DataBase (IRDB)

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

Author: Heiga Zen
Keiichi Tokuda
Ranniery Maia
Tomoki Toda
Yoshihiko Nankaku
Publication venue
Publication date: 02/03/2023
Field of study

Institutional Repositories DataBase (IRDB)